IBM HR Employee Attrition¶
Context¶
Organizations invest heavily in employee development, satisfaction, and retention. However, high attrition rates can lead to significant costs — including lost productivity, recruitment efforts, and onboarding. The ability to predict employee attrition can help HR departments take proactive steps to retain valuable talent.
This project uses IBM’s HR Analytics dataset, which contains detailed information about employees’ roles, compensation, performance, satisfaction, work environment, and more.
Objective¶
- To identify the different factors that drive attrition
- To build a model to predict if an employee will attrite or not
Dataset Description¶
| Column | Description |
|---|---|
| Age | Age of the employee |
| Attrition | Whether the employee left the company (Yes/No) |
| BusinessTravel | Frequency of business travel |
| DailyRate | Daily salary rate |
| Department | Department the employee belongs to |
| DistanceFromHome | Distance from employee's home to workplace |
| Education | Education level (1–5) |
| EducationField | Field of education (e.g., Life Sciences, Marketing) |
| EnvironmentSatisfaction | Satisfaction with work environment (1–4) |
| Gender | Employee gender |
| HourlyRate | Hourly wage |
| JobInvolvement | Level of job involvement (1–4) |
| JobLevel | Employee job level (1–5) |
| JobRole | Specific job title |
| JobSatisfaction | Satisfaction with the job (1–4) |
| MaritalStatus | Marital status |
| MonthlyIncome | Monthly salary |
| MonthlyRate | Monthly rate |
| NumCompaniesWorked | Number of companies worked for prior |
| OverTime | Whether employee works overtime (Yes/No) |
| PercentSalaryHike | Percentage salary increase |
| PerformanceRating | Performance rating (1–4) |
| RelationshipSatisfaction | Satisfaction with relationships (1–4) |
| StockOptionLevel | Stock option level |
| TotalWorkingYears | Total years of professional experience |
| TrainingTimesLastYear | Times participated in training in last year |
| WorkLifeBalance | Work-life balance rating (1–4) |
| YearsAtCompany | Years spent at the company |
| YearsInCurrentRole | Years in the current role |
| YearsSinceLastPromotion | Years since last promotion |
| YearsWithCurrManager | Years under current manager |
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Importing Liberies¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
# For tuning the model
from sklearn.model_selection import GridSearchCV
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
sns.set()
Loading the dataset¶
# Reading the dataset
df = pd.read_csv('/content/drive/MyDrive/My DS DA/Employee Attrition/data.csv')
df.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
Data Overview¶
Info¶
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
Total Records: 1,470 employees
Total Features: 35 columns
Missing Data: None (all columns have 1,470 non-null values)
Numeric (int64): 26 columns (e.g., Age, MonthlyIncome, TotalWorkingYears)
Categorical (object): 9 columns (e.g., Gender, Department, JobRole)
Target Variable:
Attrition(Yes/No)Identifier:
EmployeeNumber(unique ID, likely not useful for modeling)Encoding Needed: Categorical columns must be encoded before modeling:
Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,OverTime
Unique Values¶
df.nunique()
| 0 | |
|---|---|
| Age | 43 |
| Attrition | 2 |
| BusinessTravel | 3 |
| DailyRate | 886 |
| Department | 3 |
| DistanceFromHome | 29 |
| Education | 5 |
| EducationField | 6 |
| EmployeeCount | 1 |
| EmployeeNumber | 1470 |
| EnvironmentSatisfaction | 4 |
| Gender | 2 |
| HourlyRate | 71 |
| JobInvolvement | 4 |
| JobLevel | 5 |
| JobRole | 9 |
| JobSatisfaction | 4 |
| MaritalStatus | 3 |
| MonthlyIncome | 1349 |
| MonthlyRate | 1427 |
| NumCompaniesWorked | 10 |
| Over18 | 1 |
| OverTime | 2 |
| PercentSalaryHike | 15 |
| PerformanceRating | 2 |
| RelationshipSatisfaction | 4 |
| StandardHours | 1 |
| StockOptionLevel | 4 |
| TotalWorkingYears | 40 |
| TrainingTimesLastYear | 7 |
| WorkLifeBalance | 4 |
| YearsAtCompany | 37 |
| YearsInCurrentRole | 19 |
| YearsSinceLastPromotion | 16 |
| YearsWithCurrManager | 18 |
Total Records: 1,470
Total Features: 35
No missing values in any column
Data Types: 26 numerical, 9 categorical
Attrition: Binary classification with 2 unique values (Yes/No)
Age: 43 unique values, likely continuousGender: 2 values (Male/Female)MaritalStatus: 3 valuesOver18: Only 1 value – not useful for modelingEducation: 5 levelsEducationField: 6 distinct education fieldsJobRole: 9 rolesDepartment: 3 departmentsJobLevel: 5 levelsSatisfaction metrics (
JobSatisfaction,JobInvolvement,EnvironmentSatisfaction,RelationshipSatisfaction): 4 levels eachYearsAtCompany: 37 unique valuesYearsInCurrentRole: 19YearsSinceLastPromotion: 16YearsWithCurrManager: 18NumCompaniesWorked: 10TotalWorkingYears: 40TrainingTimesLastYear: 7MonthlyIncome: 1,349 unique values – high varianceMonthlyRate: 1,427 unique values – near uniqueHourlyRate: 71DailyRate: 888PercentSalaryHike: 15 valuesStockOptionLevel: 4 levelsBusinessTravel: 3 levelsDistanceFromHome: 29 valuesStandardHours: Only 1 value – drop candidateOverTime: 2 values (Yes/No)WorkLifeBalance: 4 levelsEmployeeCount,Over18,StandardHours: Constant – can be droppedEmployeeNumber: Unique identifier – drop for modeling
df.columns.to_list()
['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
- Observations
- Drop Columns
- EmployeeNumber: identifier which is unique for each employee
- Over18: have only 1 unique value
- StandardHours: have only 1 unique value
- EmployeeCount: constant
- Drop Columns
df = df.drop(['EmployeeNumber', 'Over18', 'StandardHours', 'EmployeeCount'], axis = 1)
# Creating numerical columns
num_cols = ['Age',
'DailyRate',
'DistanceFromHome',
'Education',
'EnvironmentSatisfaction',
'HourlyRate',
'JobInvolvement',
'JobLevel',
'JobSatisfaction',
'MonthlyIncome',
'MonthlyRate',
'NumCompaniesWorked',
'PercentSalaryHike',
'PerformanceRating',
'RelationshipSatisfaction',
'StockOptionLevel',
'TotalWorkingYears',
'TrainingTimesLastYear',
'WorkLifeBalance',
'YearsAtCompany',
'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager']
# Creating categorical variables
cat_cols = ['Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'OverTime']
df.isnull().sum().sum()
np.int64(0)
df.duplicated().sum()
np.int64(0)
Exploratory Data Analysis¶
Univariate analysis of numerical columns¶
# Checking summary statistics
df[num_cols].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 1470.0 | 36.923810 | 9.135373 | 18.0 | 30.0 | 36.0 | 43.00 | 60.0 |
| DailyRate | 1470.0 | 802.485714 | 403.509100 | 102.0 | 465.0 | 802.0 | 1157.00 | 1499.0 |
| DistanceFromHome | 1470.0 | 9.192517 | 8.106864 | 1.0 | 2.0 | 7.0 | 14.00 | 29.0 |
| Education | 1470.0 | 2.912925 | 1.024165 | 1.0 | 2.0 | 3.0 | 4.00 | 5.0 |
| EnvironmentSatisfaction | 1470.0 | 2.721769 | 1.093082 | 1.0 | 2.0 | 3.0 | 4.00 | 4.0 |
| HourlyRate | 1470.0 | 65.891156 | 20.329428 | 30.0 | 48.0 | 66.0 | 83.75 | 100.0 |
| JobInvolvement | 1470.0 | 2.729932 | 0.711561 | 1.0 | 2.0 | 3.0 | 3.00 | 4.0 |
| JobLevel | 1470.0 | 2.063946 | 1.106940 | 1.0 | 1.0 | 2.0 | 3.00 | 5.0 |
| JobSatisfaction | 1470.0 | 2.728571 | 1.102846 | 1.0 | 2.0 | 3.0 | 4.00 | 4.0 |
| MonthlyIncome | 1470.0 | 6502.931293 | 4707.956783 | 1009.0 | 2911.0 | 4919.0 | 8379.00 | 19999.0 |
| MonthlyRate | 1470.0 | 14313.103401 | 7117.786044 | 2094.0 | 8047.0 | 14235.5 | 20461.50 | 26999.0 |
| NumCompaniesWorked | 1470.0 | 2.693197 | 2.498009 | 0.0 | 1.0 | 2.0 | 4.00 | 9.0 |
| PercentSalaryHike | 1470.0 | 15.209524 | 3.659938 | 11.0 | 12.0 | 14.0 | 18.00 | 25.0 |
| PerformanceRating | 1470.0 | 3.153741 | 0.360824 | 3.0 | 3.0 | 3.0 | 3.00 | 4.0 |
| RelationshipSatisfaction | 1470.0 | 2.712245 | 1.081209 | 1.0 | 2.0 | 3.0 | 4.00 | 4.0 |
| StockOptionLevel | 1470.0 | 0.793878 | 0.852077 | 0.0 | 0.0 | 1.0 | 1.00 | 3.0 |
| TotalWorkingYears | 1470.0 | 11.279592 | 7.780782 | 0.0 | 6.0 | 10.0 | 15.00 | 40.0 |
| TrainingTimesLastYear | 1470.0 | 2.799320 | 1.289271 | 0.0 | 2.0 | 3.0 | 3.00 | 6.0 |
| WorkLifeBalance | 1470.0 | 2.761224 | 0.706476 | 1.0 | 2.0 | 3.0 | 3.00 | 4.0 |
| YearsAtCompany | 1470.0 | 7.008163 | 6.126525 | 0.0 | 3.0 | 5.0 | 9.00 | 40.0 |
| YearsInCurrentRole | 1470.0 | 4.229252 | 3.623137 | 0.0 | 2.0 | 3.0 | 7.00 | 18.0 |
| YearsSinceLastPromotion | 1470.0 | 2.187755 | 3.222430 | 0.0 | 0.0 | 1.0 | 3.00 | 15.0 |
| YearsWithCurrManager | 1470.0 | 4.123129 | 3.568136 | 0.0 | 2.0 | 3.0 | 7.00 | 17.0 |
- Observations
- All features have complete data (
count = 1470). MonthlyIncomeandMonthlyRateare highly skewed → consider log transform.PerformanceRatingandStockOptionLevelshow low variance → may be dropped.YearsAtCompany,YearsSinceLastPromotion, andTotalWorkingYearsshow wide ranges → consider binning.- Satisfaction and involvement scores (
EnvironmentSatisfaction,JobSatisfaction, etc.) are ordinal → treat accordingly. NumCompaniesWorked = 0may indicate first job → consider encoding.
- All features have complete data (
# Creating histograms
df[num_cols].hist(figsize = (14,14))
plt.tight_layout()
- Observatons
Right-skewed features:
DistanceFromHome,MonthlyIncome,MonthlyRate,TotalWorkingYears,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager→ consider log transform or binning.
Low variance features:
PerformanceRating(mostly 3 or 4),StockOptionLevel(mostly 0 or 1) → may offer little predictive value.
Balanced or near-normal distributions:
Age,HourlyRate,DailyRateshow fairly even spread.
Ordinal categorical features:
JobLevel,Education,WorkLifeBalance,JobInvolvement, and satisfaction scores show discrete peaks → treat as ordinal, not continuous.
Notable clustering:
NumCompaniesWorkedhas a spike at 0 → may indicate first-time employees.TrainingTimesLastYearandPercentSalaryHikeare centered on specific values.
Potential engineered features:
- Grouping
Years...orIncome-related columns into bins may enhance model performance.
- Grouping
Univariate analysis for categorical variables¶
for i in cat_cols:
print(df[i].value_counts(normalize = True))
print('*' * 40)
Attrition No 0.838776 Yes 0.161224 Name: proportion, dtype: float64 **************************************** BusinessTravel Travel_Rarely 0.709524 Travel_Frequently 0.188435 Non-Travel 0.102041 Name: proportion, dtype: float64 **************************************** Department Research & Development 0.653741 Sales 0.303401 Human Resources 0.042857 Name: proportion, dtype: float64 **************************************** EducationField Life Sciences 0.412245 Medical 0.315646 Marketing 0.108163 Technical Degree 0.089796 Other 0.055782 Human Resources 0.018367 Name: proportion, dtype: float64 **************************************** Gender Male 0.6 Female 0.4 Name: proportion, dtype: float64 **************************************** JobRole Sales Executive 0.221769 Research Scientist 0.198639 Laboratory Technician 0.176190 Manufacturing Director 0.098639 Healthcare Representative 0.089116 Manager 0.069388 Sales Representative 0.056463 Research Director 0.054422 Human Resources 0.035374 Name: proportion, dtype: float64 **************************************** MaritalStatus Married 0.457823 Single 0.319728 Divorced 0.222449 Name: proportion, dtype: float64 **************************************** OverTime No 0.717007 Yes 0.282993 Name: proportion, dtype: float64 ****************************************
- Observations
- Attrition: Highly imbalanced (Yes = 16.1%, No = 83.9%) → requires stratification or class weighting in modeling.
- BusinessTravel: Majority travel rarely (70.9%); frequent travelers (18.8%) may correlate with higher attrition.
- Department: Most employees are in R&D (65.4%), with fewer in Sales (30.3%) and HR (4.3%).
- EducationField: Life Sciences (41.2%) and Medical (31.6%) dominate; Human Resources is rare (~1.8%).
- Gender: Male-dominated (60%); may require fairness checks in modeling.
- JobRole: Spread out, but Sales Execs and Research Scientists are the most common; HR is only 3.6%.
- MaritalStatus: Majority are married (45.8%), followed by single (32%); divorce is less common.
- OverTime: 28.0% of employees work overtime — a key factor often linked to attrition.
Bivariate and Multivariate analysis¶
#Crosstab
pd.crosstab(df['OverTime'], df['Attrition'], normalize = 'index')
| Attrition | No | Yes |
|---|---|---|
| OverTime | ||
| No | 0.895636 | 0.104364 |
| Yes | 0.694712 | 0.305288 |
# How many employees do not work over time and have not left the company
df[(df['OverTime'] == 'No') & (df['Attrition'] == 'No')].shape
(944, 31)
# #mployes who do not work over time but still left the company
df[(df['OverTime'] == 'No') & (df['Attrition'] == 'Yes')].shape
(110, 31)
# Proportion of employees who do not work overtime, but still left the company
110/(944 + 110)
0.10436432637571158
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 10), sharey=True)
i = 0
for x in range(2):
for y in range(4):
if i < len(cat_cols):
pd.crosstab(df[cat_cols[i]], df['Attrition'], normalize='index')\
.mul(100).plot(kind='bar', stacked=True, ax=axes[x, y])
axes[x, y].set_ylabel('Percentage Attrition %')
axes[x, y].set_title(cat_cols[i])
i += 1
else:
axes[x, y].axis('off') # Hide empty subplot
plt.tight_layout()
plt.show()
- Observations
- OverTime: Strongest indicator — employees working overtime have much higher attrition.
- BusinessTravel: Frequent travelers are more likely to leave than those who travel rarely or not at all.
- JobRole: Sales Representatives and Laboratory Technicians show higher attrition; Managers and Directors show lower.
- Department: Sales has higher attrition compared to R&D and HR.
- MaritalStatus: Single employees are more likely to leave than married or divorced ones.
- EducationField: Attrition appears fairly consistent across fields, with minor differences.
- Gender: Slightly higher attrition among females, though the gap is small.
Relationship between attrition and Numerical variables
# The mean of numerical variables grouped by attrition
df.groupby(['Attrition'])[num_cols].mean()
| Age | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition | |||||||||||||||||||||
| No | 37.561233 | 812.504461 | 8.915653 | 2.927007 | 2.771290 | 65.952149 | 2.770479 | 2.145985 | 2.778589 | 6832.739659 | ... | 3.153285 | 2.733982 | 0.845093 | 11.862936 | 2.832928 | 2.781022 | 7.369019 | 4.484185 | 2.234388 | 4.367397 |
| Yes | 33.607595 | 750.362869 | 10.632911 | 2.839662 | 2.464135 | 65.573840 | 2.518987 | 1.637131 | 2.468354 | 4787.092827 | ... | 3.156118 | 2.599156 | 0.527426 | 8.244726 | 2.624473 | 2.658228 | 5.130802 | 2.902954 | 1.945148 | 2.852321 |
2 rows × 23 columns
- Observations
- Age: Employees who left are younger (33.6 vs 37.6).
- MonthlyIncome: Those who left earned significantly less (≈ $4.8k vs $6.8k).
- JobLevel: Leavers tend to be in lower-level positions (1.64 vs 2.15).
- TotalWorkingYears: Lower average experience for leavers (8.2 vs 11.9).
- YearsWithCurrManager: Much lower for leavers (2.85 vs 4.37), possibly indicating weak leadership ties.
- YearsInCurrentRole: Leavers had shorter tenure in role (2.9 vs 4.5).
- JobInvolvement & Satisfaction Scores: Lower across the board for leavers (e.g., JobSatisfaction: 2.47 vs 2.78).
- EnvironmentSatisfaction: Lower among those who left (2.46 vs 2.77).
- DistanceFromHome: Higher for leavers (10.6 vs 8.9) — long commutes may affect attrition.
- StockOptionLevel: Higher for those who stayed (0.85 vs 0.53) — equity may help retain talent.
- Find out what kind of employees are leaving the company more
Relationship between different numerical variables
# Plotting the correlation between numerical variables
plt.figure(figsize = (15, 8))
mask = np.triu(df[num_cols].corr())
sns.heatmap(df[num_cols].corr(), annot = True, fmt = '0.2f', cmap = 'YlGnBu', mask=mask)
<Axes: >
- Observations
Strong Positive Correlations:
MonthlyIncomevsJobLevel(0.95)YearsAtCompanyvsYearsInCurrentRole(0.76)YearsWithCurrManagervsYearsInCurrentRole(0.71)TotalWorkingYearsvsAge(0.68)JobLevelvsTotalWorkingYears(0.78)
Moderate Positive Correlations:
MonthlyIncomevsTotalWorkingYears(0.62)JobLevelvsMonthlyIncome(0.78)
Low or Negligible Correlations:
DailyRate,DistanceFromHome,HourlyRate, andPerformanceRatingshow weak or no significant correlation with other features.TrainingTimesLastYearandWorkLifeBalancehave almost no correlation with other variables.
Summary of EDA¶
Observations
Data Description
- The dataset contains HR data on 1470 employees with 35 features, including:
- Target:
Attrition(Yes/No) - Numerical: Age, MonthlyIncome, YearsAtCompany, etc.
- Categorical: JobRole, Department, Gender, BusinessTravel, etc.
- Covers job satisfaction, performance, compensation, and tenure metrics.
- Target:
- The dataset contains HR data on 1470 employees with 35 features, including:
Data Cleaning
- Dropped constant or ID-like columns:
EmployeeCount,EmployeeNumber,Over18,StandardHours
- Verified no missing values (
count = 1470for all columns). - Identified and handled class imbalance: only ~16% of employees left (
Attrition == Yes). - Separated features into:
- Numerical (23)
- Categorical (7)
- Dropped constant or ID-like columns:
Observations from EDA
MonthlyIncome,MonthlyRate,TotalWorkingYears,DistanceFromHomeare right-skewed.PerformanceRating,StockOptionLevelshow low variance.AgeandHourlyRateare evenly distributed.JobLevel,Education, and satisfaction scores are ordinal and discrete.
Model Building - Approach¶
- Prepare the data for modeling.
- Partition the data into train and test sets.
- Build the model on the train data.
- Tune the model if required.
- Test the data on the test set.
Preparing data for modeling¶
#replace values in a pandas DataFrame
#Method 1
#df['OverTime'].replace({'Yes': 1, 'No': 0})
#Method 2
np.where(df['OverTime'] == 'Yes', 1, 0)
# This checks for each value in the OverTime column if it is equal to 'Yes'
# If the condition is True, it assigns 1, otherwise 0 is assigned
array([1, 0, 1, ..., 1, 0, 0])
# Creating dummy variables for categorical Variables
# Creating the list of columns for which we need to create the dummy variables
to_get_dummies_for = ['BusinessTravel', 'Department', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement',
'JobLevel', 'JobRole', 'MaritalStatus']
# Creating dummy variables
df = pd.get_dummies(data = df, columns = to_get_dummies_for, drop_first = True)
# Mapping overtime and attrition
dict_OverTime = {'Yes': 1, 'No': 0} # just like above (created to map categorical string values ('Yes' and 'No') into numeric values (1 and 0)
dict_attrition = {'Yes': 1, 'No': 0}
df['OverTime'] = df.OverTime.map(dict_OverTime) # maps the 'OverTime' column's values ('Yes' to 1, 'No' to 0).
df['Attrition'] = df.Attrition.map(dict_attrition)
# Separating the independent variables (X) and the dependent variable (Y)
# Separating the target variable and other variables
Y = df.Attrition
X = df.drop(columns = ['Attrition']) #target
X.shape # training data
(1470, 54)
Scaling Optoins and Ranking¶
Below, we are scaling the numerical variables in the dataset to have the same range. If we don't do this, then the model will be biased towards a variable where we have a higher range and the model will not learn from the variables with a lower range. There are many ways to do scaling. Here, we are using MinMaxScaler as we have both categorical and numerical variables in the dataset and don't want to change the dummy encodings of the categorical variables that we have already created. For more information on different ways of doing scaling, refer to the section 6.3.1 of this page here
data = df.copy()
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
PowerTransformer, QuantileTransformer, Normalizer
)
# 📌 Load dataset (Modify path if needed)
#file_path = "your_dataset.csv" # Update with correct file
#data = pd.read_csv(file_path)
# 📌 Select numeric columns only
numeric_features = data.select_dtypes(include=['number']).dropna()
# 📌 Detect Outliers using IQR (Interquartile Range)
def detect_outliers(df1):
outlier_info = {}
for column in df1.columns:
Q1 = df1[column].quantile(0.25)
Q3 = df1[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Count outliers
outlier_count = ((df1[column] < lower_bound) | (df1[column] > upper_bound)).sum()
outlier_info[column] = outlier_count
return outlier_info
outliers = detect_outliers(numeric_features)
outlier_df1 = pd.DataFrame.from_dict(outliers, orient='index', columns=['Outlier Count'])
# 📌 Check Skewness (to determine if transformation is needed)
skewness = numeric_features.skew()
# 📌 Scaling Options with All Variants
scalers = {
"StandardScaler": StandardScaler(),
"MinMaxScaler": MinMaxScaler(),
"RobustScaler": RobustScaler(),
"MaxAbsScaler": MaxAbsScaler(),
"PowerTransformer (Yeo-Johnson)": PowerTransformer(method="yeo-johnson"),
"PowerTransformer (Box-Cox)": PowerTransformer(method="box-cox"),
"QuantileTransformer (Normal)": QuantileTransformer(output_distribution="normal"),
"QuantileTransformer (Uniform)": QuantileTransformer(output_distribution="uniform"),
"Normalizer (L1)": Normalizer(norm="l1"),
"Normalizer (L2)": Normalizer(norm="l2"),
"Normalizer (Max)": Normalizer(norm="max"),
}
scaler_results = {}
# 📌 Try all scalers & transformations
for scaler_name, scaler in scalers.items():
try:
transformed_data = scaler.fit_transform(numeric_features)
transformed_df1 = pd.DataFrame(transformed_data, columns=numeric_features.columns, index=numeric_features.index)
# Calculate variance after scaling (Higher is better)
cumulative_variance = transformed_df1.var().sum()
scaler_results[scaler_name] = cumulative_variance
except Exception as e:
print(f"❌ Error with {scaler_name}: {e}")
# 📌 Convert results to DataFrame and Rank
scaler_rank_df1 = pd.DataFrame(list(scaler_results.items()), columns=['Scaler', 'Cumulative Variance'])
scaler_rank_df1 = scaler_rank_df1.sort_values(by="Cumulative Variance", ascending=False)
# 📌 Pick the best scaler based on variance ranking
best_scaler_name = scaler_rank_df1.iloc[0]["Scaler"]
best_scaler = scalers[best_scaler_name]
scaled_data = best_scaler.fit_transform(numeric_features)
# 📌 Convert Scaled Data to DataFrame
scaled_df1 = pd.DataFrame(scaled_data, columns=numeric_features.columns, index=numeric_features.index)
# 📌 Save the scaled dataset
scaled_df1.to_csv("/content/drive/MyDrive/My DS DA/Employee Attrition/scaled_dataset.csv", index=False)
# 📌 Explain Selection
explanations = {
"MinMaxScaler": "Scales data between [0,1]. Used in Unsupervised Learning & Deep Learning but NOT recommended for outliers.",
"StandardScaler": "Standardizes data to mean=0, variance=1. Recommended for Unsupervised Learning & Deep Learning. Works well with outliers.",
"RobustScaler": "Uses median and IQR. Best choice for handling outliers in both Unsupervised and Supervised Learning.",
"MaxAbsScaler": "Scales data between [-1,1]. Used for sparse data, not generally needed for typical datasets.",
"PowerTransformer (Yeo-Johnson)": "Transforms data to be more Gaussian-like. Good for skewed data in Regression & Classification.",
"PowerTransformer (Box-Cox)": "Similar to Yeo-Johnson but works only with strictly positive values. Reduces skewness.",
"QuantileTransformer (Normal)": "Maps data to a normal distribution. Works well when distribution is unknown.",
"QuantileTransformer (Uniform)": "Maps data to a uniform distribution. Useful for highly irregular distributions.",
"Normalizer (L1)": "Normalizes each row by L1 norm. Good for text-based or sparse datasets.",
"Normalizer (L2)": "Normalizes each row by L2 norm. Often used in clustering tasks.",
"Normalizer (Max)": "Normalizes each row by its maximum absolute value. Good for text and sparse data."
}
# 📌 Print Summary
print("\n📊 Scaler Rankings:")
print(scaler_rank_df1)
print(f"\n🏆 Best Scaler Chosen: {best_scaler_name}")
print(f"📌 Reason: {explanations.get(best_scaler_name, 'No explanation available')}")
# 📌 Print Insights
if outlier_df1['Outlier Count'].sum() > 0:
print("\n⚠️ Outliers detected! **RobustScaler** is recommended if handling them is critical.")
else:
print("\n✅ No significant outliers detected. **StandardScaler** is the default recommendation.")
if any(abs(skewness) > 1):
print("\n⚠️ Skewed data detected! **PowerTransformer**")
❌ Error with PowerTransformer (Box-Cox): The Box-Cox transformation can only be applied to strictly positive data
📊 Scaler Rankings:
Scaler Cumulative Variance
5 QuantileTransformer (Normal) 135.841283
0 StandardScaler 21.014295
4 PowerTransformer (Yeo-Johnson) 20.013615
2 RobustScaler 11.679481
6 QuantileTransformer (Uniform) 2.197743
1 MinMaxScaler 1.695346
3 MaxAbsScaler 1.284149
9 Normalizer (Max) 0.135088
8 Normalizer (L2) 0.098395
7 Normalizer (L1) 0.060545
🏆 Best Scaler Chosen: QuantileTransformer (Normal)
📌 Reason: Maps data to a normal distribution. Works well when distribution is unknown.
⚠️ Outliers detected! **RobustScaler** is recommended if handling them is critical.
⚠️ Skewed data detected! **PowerTransformer**
Scaling the data¶
The independent variables in this dataset have different scales. When features have different scales from each other, there is a chance that a higher weightage will be given to features that have a higher magnitude, and they will dominate over other features whose magnitude changes may be smaller but whose percentage changes may be just as significant or even larger. This will impact the performance of our machine learning algorithm, and we do not want our algorithm to be biased towards one feature.
The solution to this issue is Feature Scaling, i.e. scaling the dataset so as to give every transformed variable a comparable scale.
In this problem, we will use the Standard Scaler method, which centers and scales the dataset using the Z-Score.
It standardizes features by subtracting the mean and scaling it to have unit variance.
# # Scaling the data
# sc = StandardScaler()
# X_scaled = sc.fit_transform(X)
# X_scaled = pd.DataFrame(X_scaled, columns = X.columns)
from sklearn.preprocessing import QuantileTransformer
# Scaling the data
sc = QuantileTransformer(output_distribution='normal')
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
Splitting the data into 70% train and 30% test sets¶
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance, there could be several times more negative samples than positive samples. In such cases, it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.
# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, test_size = 0.3, random_state = 1, stratify = Y)
stratify=Y
- data is split in such a way that the proportion of each class in the target variable Y is preserved in both the training and testing sets.
- This is particularly useful in cases where the target variable is imbalanced (i.e., there are more instances of one class than another), ensuring that both sets have a similar distribution of classes.
- ensures that the split maintains a similar class distribution in both the training and testing sets.
Check for Imbalanced Data¶
sns.countplot(data=df, x='Attrition', edgecolor = "black");
# Number of samples in the target variable. Original data
sum(Y == 0), sum(Y == 1)
(1233, 237)
# Proportion of class "0"
round(2466/(2466+474) * 100, 2)
83.88
# Number of samples in the target variable. Training data after split using "stratify"
sum(y_train == 0), sum(y_train == 1)
(863, 166)
# Proportion of class "0"
round(1726/(1726+332) * 100 ,2)
83.87
# Number of samples in the target variable. Training data after split without using "stratify"
_, _, y_train_no_strat, y_test_no_strat = train_test_split(X_scaled, Y, test_size = 0.3, random_state=1)
sum(y_train_no_strat == 0), sum(y_train_no_strat == 1)
(869, 160)
# Proportion of class "0"
round(1737/(1737 + 321) *100, 2)
84.4
By default, sklearn uses stratified sampling in order to ensure that relative class frequencies is approximately preserved in each train and test set. More information here.
Model evaluation criterion¶
Model Evaluation Notes – Choosing the Right Metric
Project Goal: Predict which employees are likely to leave (attrition = "Yes") to enable early intervention and improve retention.
Recommended Evaluation Metric: Recall (for Attrition = Yes)
- Definition: Recall = True Positives / (True Positives + False Negatives)
- Reason: Measures how many actual leavers the model successfully identifies.
- Why it's important:
- Missing a potential leaver (false negative) is costly for HR and business.
- Helps proactively retain valuable employees before they exit.
Secondary Metric: F1-Score (for Attrition = Yes)
- Definition: F1 = 2 × (Precision × Recall) / (Precision + Recall)
- Use when: You want to balance between:
- Catching attrition (high recall)
- Avoiding too many false alerts (precision)
Avoid using accuracy alone:
- Dataset is imbalanced (Attrition = Yes is only ~16%)
- High accuracy can be misleading if the model mostly predicts "No"
Summary:
- Primary focus: Recall (Yes) — don't miss actual leavers
- Secondary: F1-score (Yes) — balance between false positives and false negatives
The model can make two types of wrong predictions:
- Predicting an employee will attrite when the employee doesn't attrite (FP)
- Predicting an employee will not attrite when the employee actually attrites (FN)
- Need to reduce the FN (important for company to reduce)
- Need to know why people are leaving
- Need to reduce this
Which case is more important?
- Predicting that the employee will not attrite but the employee attrites, i.e., losing out on a valuable employee or asset. This would be considered a major miss for any employee attrition predictor and is hence the more important case of wrong predictions.
How to reduce this loss i.e the need to reduce False Negatives?
The company would want the Recall to be maximized, the greater the Recall, the higher the chances of minimizing false negatives. Hence, the focus should be on increasing the Recall (minimizing the false negatives) or, in other words, identifying the true positives (i.e. Class 1) very well, so that the company can provide incentives to control the attrition rate especially, for top-performers. This would help in optimizing the overall project cost towards retaining the best talent.
Recall (Sensitivity/TP Rate)
- Use Case:
- when missing a positive instance is more critical
- e.g., identifying fraud, diagnosing disease
- important when you want to minimize false negatives (FN) by maximizing recall
- when missing a positive instance is more critical
- Use Case:
Also, let's create a function to calculate and print the classification report and the confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
Range in recall is 0 t0 1
- Value closest to one is better, hence train with .98 is better
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
# Confusion matrix
cm = confusion_matrix(actual, predicted)
# Plot confusion matrix
plt.figure(figsize = (7, 4))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Attrite', 'Attrite'], yticklabels = ['Not Attrite', 'Attrite'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predicting using the independent variables
pred = model.predict(predictors)
recall = recall_score(target, pred,average = 'macro') # To compute recall
precision = precision_score(target, pred, average = 'macro') # To compute precision
acc = accuracy_score(target, pred) # To compute accuracy score
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Precision": precision,
"Recall": recall,
"Accuracy": acc,
},
index = [0],
)
return df_perf
Building the model¶
Models
- Decision Tree
- Random Forest
Building a Decision Tree Model¶
from sklearn.tree import DecisionTreeClassifier
# Building decision tree model
# dt = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
dt_no_weights = DecisionTreeClassifier(random_state = 1)
dt_weights = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1) # to give more importance to the class with the lowest percentage
# class weight (hyperparameter) is useful when dealing with unbalanced/imbalanced data: see cell 74 above (sns.countplot(data=df, x='Attrition', edgecolor = 'black')
# cell 50: Percentage of Label 0 (Not Attriate): 83.88. Percentage of Label 1 (Attritiate): 16.12 = 0.83 given to class with lowest percentage = class 1 !!!!!!!!!!!
# flipped/inversed it
# for unbalanced/imbalanced data
# can be optimized but instructor stated better to get form distribution (cell 74)
supervised learning classification tasks with unbalanced/imbalanced data, optimizing your model to account for class imbalance
- Resampling Techniques
- Oversampling Minority Class
- Undersampling Majority Class
- Combination of Both
- Class Weight Adjustment
- Adjust class_weight parameter (as you have done) to give more importance to the minority class. (what we did)
- Data Augmentation Techniques
- Rotation
- Flipping
- Scaling
- Cropping
- Use of Different Metrics
- F1-Score
- Precision-Recall Curve
- ROC-AUC Score
- Confusion Matrix
- Ensemble Learning Methods
- Random Forest
- XGBoost Parameter scale_pos_weight
- Threshing Tuning
- Stratified Sampling
- Custom Loss Functions
- Weighted Cross-Entropy
- Focal Loss
- Advanced Techniques
- Cost-Sensitive Learning
- Two-Stage Models
- Anomaly Detection Approaches
# # Fitting decision tree model
# dt.fit(x_train, y_train)
# Fitting decision tree model
dt_no_weights.fit(x_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
# Fitting decision tree model
dt_weights.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)Let's check the model performance of decision tree
# Checking performance on the training dataset
y_train_pred_dt = dt_weights.predict(x_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 1.00 1.00 1.00 863
1 1.00 1.00 1.00 166
accuracy 1.00 1029
macro avg 1.00 1.00 1.00 1029
weighted avg 1.00 1.00 1.00 1029
- Observations
- The decision tree model shows perfect performance on the training set with 100% precision, recall, and F1-score for both classes.
- The confusion matrix confirms no misclassifications. This is a strong indicator of overfitting, as real-world data rarely allows for such ideal separation.
- Evaluation on the test set is needed to assess generalizability.
macro refers to the method used to average the metric across multiple classes in a multi-class classification problem
Weighted Averaging
- weights each class by its support (number of samples in the class)
- Gives more importance to larger classes.
Macro Averaging
- Calculates the metric (e.g., precision, recall) independently for each class and then takes the average of these values.
- Useful when you want to treat all classes equally, even if some classes have fewer samples.
# Checking performance on the test dataset
y_test_pred_dt = dt_weights.predict(x_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.88 0.85 0.86 370
1 0.34 0.41 0.37 71
accuracy 0.78 441
macro avg 0.61 0.63 0.62 441
weighted avg 0.79 0.78 0.78 441
Observations – Decision Tree (Test Set)
- On the test set, the model's performance drops significantly, especially for the minority class.
- Recall for class 1 (Attrite) is just 0.41, and precision is 0.34, indicating a high number of false positives.
- While overall accuracy is 78%, the model struggles with correctly identifying attrition cases. This confirms overfitting observed in the training set.
Decision Tree Model – Train vs Test Comparison (with Class Weights)
| Metric | Train Set | Test Set |
|---|---|---|
| Accuracy | 1.00 | 0.78 |
| Precision (Attrite) | 1.00 | 0.34 |
| Recall (Attrite) | 1.00 | 0.41 |
| F1-score (Attrite) | 1.00 | 0.37 |
| Macro Avg F1 | 1.00 | 0.63 |
| Weighted Avg F1 | 1.00 | 0.78 |
Observations
- The Decision Tree model with class weights achieves perfect performance on the training data, indicating overfitting.
- On the test set, the recall for the minority class (Attrite) is only 0.41, suggesting that the model fails to generalize well.
- While class weighting improves sensitivity to attrition, the drop in test set metrics reveals poor generalization, making the model unreliable in production without pruning or regularization.
# performance_classification with training data
model_performance_classification(dt_weights, x_train, y_train)
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 |
# dtree_test = model_performance_classification(dt,x_test,y_test)
dtree_test = model_performance_classification(dt_weights,x_test,y_test)
dtree_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.60945 | 0.627198 | 0.77551 |
Let's plot the feature importance and check the most important features.
importances = dt_weights.feature_importances_
importances
array([3.79936593e-02, 6.82162389e-02, 8.10343094e-02, 5.09446909e-02,
4.39879277e-02, 1.01489495e-01, 6.73985155e-02, 1.13790091e-02,
6.06921082e-02, 2.03138490e-02, 1.08217662e-15, 2.01208561e-02,
8.23041683e-02, 3.20681971e-02, 1.67164389e-02, 1.22094076e-02,
6.53640377e-02, 1.70946813e-02, 1.57082809e-02, 1.43828341e-02,
1.97231699e-16, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
7.67886455e-03, 0.00000000e+00, 2.19606906e-03, 0.00000000e+00,
1.72890644e-02, 0.00000000e+00, 1.16721790e-17, 7.71928139e-03,
9.95701923e-03, 2.54090092e-02, 2.19240018e-02, 1.38341564e-02,
1.26714609e-02, 1.05208205e-02, 0.00000000e+00, 1.96886075e-03,
0.00000000e+00, 5.80786172e-03, 1.67413487e-03, 0.00000000e+00,
3.81392623e-03, 8.66106155e-03, 0.00000000e+00, 3.62580696e-03,
3.45506258e-03, 1.16862326e-02, 3.47014779e-04, 1.03415854e-02,
0.00000000e+00, 0.00000000e+00])
# Plot the feature importance
importances = dt_weights.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance,y=importance_df.index, palette = 'rocket');
Feature Importance – Decision Tree Classifier (with Class Weights)
MonthlyIncomeis the most influential feature, indicating compensation is strongly linked to attrition risk.StockOptionLevelandDistanceFromHomeare also major contributors; employees with fewer stock options or longer commutes are more likely to leave.- Compensation-related features like
DailyRate,MonthlyRate, andHourlyRateappear consistently high, emphasizing financial influence. OverTimeandYearsAtCompanyshow that workload and tenure play meaningful roles in predicting attrition.JobSatisfactionandAgecontribute moderately, suggesting satisfaction and career stage matter.- Work-life balance factors such as
WorkLifeBalance,TrainingTimesLastYear, andYearsSinceLastPromotionhave noticeable but lower influence. - Categorical variables (e.g.,
JobRole,EducationField,Gender) appear with lower importance, meaning role or gender-specific patterns are less critical in this model. - Overall, financial incentives, commute, and experience-related variables dominate the decision-making in the model.
Let's try to tune the model and check if we could improve the results.
Tuning the Decision Tree Classifier using GridSearch¶
- Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
- Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
- It is an exhaustive search that is performed on the specific parameter values of a model.
- The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
Criterion{“gini”, “entropy”}
- The function is to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
- The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
- The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
# from sklearn import metrics
# %%time
# # Choose the type of classifier
# dtree_tuned = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# # Grid of parameters to choose from
# parameters = {'max_depth': np.arange(2, 7), # should test a wide range of values 2, 15, 50, 100
# 'criterion': ['gini', 'entropy'],
# 'min_samples_leaf': [5, 10, 20, 25]}
# # Type of scoring used to compare parameter combinations.
# # "pos_label": It allows you to specify which class label should be considered as the positive class when calculating the scoring metric.
# # By default, pos_label = 1, but you can change it based on your specific use case.
# scorer = metrics.make_scorer(recall_score, pos_label = 1)
# # Grid search object
# gridCV = GridSearchCV(dtree_tuned, parameters, scoring = scorer, cv = 10) # cross validation, GridSearchCV to tune hyperparameters
# # Fitting the grid search on the train data
# gridCV = gridCV.fit(x_train, y_train)
# # Set the classifier to the best combination of parameters
# dtree_tuned = gridCV.best_estimator_
# # Fit the best estimator to the data
# dtree_tuned.fit(x_train, y_train)
from sklearn import metrics
from sklearn.metrics import recall_score
# Choose the type of classifier
dtree_tuned = DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
# Grid of parameters to choose from
parameters = {
'max_depth': np.arange(2, 7),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label=1)
# Grid search object
gridCV = GridSearchCV(dtree_tuned, parameters, scoring=scorer, cv=10)
# Fit the grid search on the train data
gridCV = gridCV.fit(x_train, y_train)
# Set the classifier to the best combination of parameters
dtree_tuned = gridCV.best_estimator_
# Fit the best estimator to the data
dtree_tuned.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, max_depth=np.int64(2),
min_samples_leaf=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, max_depth=np.int64(2),
min_samples_leaf=5, random_state=1)# Show attributes in the object
gridCV.best_estimator_.__dict__
{'criterion': 'gini',
'splitter': 'best',
'max_depth': np.int64(2),
'min_samples_split': 2,
'min_samples_leaf': 5,
'min_weight_fraction_leaf': 0.0,
'max_features': None,
'max_leaf_nodes': None,
'random_state': 1,
'min_impurity_decrease': 0.0,
'class_weight': {0: 0.17, 1: 0.83},
'ccp_alpha': 0.0,
'monotonic_cst': None,
'feature_names_in_': array(['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate',
'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate',
'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager',
'BusinessTravel_Travel_Frequently', 'BusinessTravel_Travel_Rarely',
'Department_Research & Development', 'Department_Sales',
'Education_2', 'Education_3', 'Education_4', 'Education_5',
'EducationField_Life Sciences', 'EducationField_Marketing',
'EducationField_Medical', 'EducationField_Other',
'EducationField_Technical Degree', 'EnvironmentSatisfaction_2',
'EnvironmentSatisfaction_3', 'EnvironmentSatisfaction_4',
'Gender_Male', 'JobInvolvement_2', 'JobInvolvement_3',
'JobInvolvement_4', 'JobLevel_2', 'JobLevel_3', 'JobLevel_4',
'JobLevel_5', 'JobRole_Human Resources',
'JobRole_Laboratory Technician', 'JobRole_Manager',
'JobRole_Manufacturing Director', 'JobRole_Research Director',
'JobRole_Research Scientist', 'JobRole_Sales Executive',
'JobRole_Sales Representative', 'MaritalStatus_Married',
'MaritalStatus_Single'], dtype=object),
'n_features_in_': 54,
'n_outputs_': 1,
'classes_': array([0, 1]),
'n_classes_': np.int64(2),
'max_features_': 54,
'tree_': <sklearn.tree._tree.Tree at 0x7d97502b30c0>}
# Checking performance on the TRAINING DATASET
y_train_pred_dt = dtree_tuned.predict(x_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 0.91 0.63 0.74 863
1 0.25 0.66 0.37 166
accuracy 0.63 1029
macro avg 0.58 0.64 0.56 1029
weighted avg 0.80 0.63 0.68 1029
Decision Tree (GridSearchCV Tuned) – Training Set Observations
- The grid search optimized the Decision Tree classifier with
max_depth=4,min_samples_leaf=5, and class weight favoring the minority class (Attrite). - Training recall for the attrition class (1) improved to 0.66, indicating better sensitivity to identifying employees at risk of leaving.
- However, precision for class 1 dropped significantly to 0.25, reflecting a higher false positive rate.
- The confusion matrix shows 109 true positives and 57 false negatives for attrition, but also 319 false positives, suggesting some overfitting to class 1.
- Overall training accuracy dropped to 0.63, indicating the tuned model prioritizes recall over precision—useful for early risk flagging in retention strategies.
# Checking performance on the TEST SET
y_test_pred_dt = dtree_tuned.predict(x_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.89 0.57 0.70 370
1 0.22 0.63 0.33 71
accuracy 0.58 441
macro avg 0.56 0.60 0.51 441
weighted avg 0.78 0.58 0.64 441
Comparison: Decision Tree (Default Weights) vs GridSearchCV-Tuned Decision Tree
| Metric | Default DT (Train) | Default DT (Test) | Tuned DT (Train) | Tuned DT (Test) |
|---|---|---|---|---|
| Accuracy | 1.00 | 0.78 | 0.63 | 0.58 |
| Precision (Attrite) | 1.00 | 0.34 | 0.25 | 0.22 |
| Recall (Attrite) | 1.00 | 0.41 | 0.66 | 0.63 |
| F1-score (Attrite) | 1.00 | 0.37 | 0.37 | 0.32 |
| Macro Avg F1 | 1.00 | 0.63 | 0.56 | 0.58 |
| Weighted Avg F1 | 1.00 | 0.78 | 0.65 | 0.64 |
bservations
- The default decision tree severely overfits the training data with perfect precision, recall, and accuracy (all 1.00), but fails to generalize well on test data.
- Its recall on test set is 0.41, meaning it misses nearly 60% of actual attrition cases.
- Tuned decision tree via GridSearchCV significantly improves recall on both train (0.66) and test (0.63), indicating better sensitivity to the positive class (Attrite).
- However, this gain in recall comes at the cost of precision (0.22 test), indicating the model makes more false positive predictions.
- The tuned model generalizes better, with a smaller performance gap between train and test sets.
- F1-score for attrition is similar between models on test data (0.37 default vs. 0.32 tuned), but the tuned model better balances recall and avoids overfitting.
Recommendation
For HR use cases where identifying potential attrition early is more critical than avoiding false positives, the GridSearchCV-tuned decision tree is preferred due to its improved recall and generalizability. Precision can be improved further with post-modeling steps like threshold tuning or ensemble methods.
from sklearn.metrics import precision_score, recall_score, accuracy_score
# TEST DATA
dtree_tuned_test = model_performance_classification(dtree_tuned,x_test,y_test)
dtree_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.556216 | 0.603388 | 0.582766 |
Output metrics BEFORE tuning the model
temp = pd.concat([model_performance_classification(dt_weights, x_train, y_train),
model_performance_classification(dt_weights, x_test, y_test)], axis=0)
temp.index = ['Training dataset', 'Test dataset']
temp
| Precision | Recall | Accuracy | |
|---|---|---|---|
| Training dataset | 1.00000 | 1.000000 | 1.00000 |
| Test dataset | 0.60945 | 0.627198 | 0.77551 |
Output metrics AFTER tuning the model. This model is not overfitting the training data
temp = pd.concat([model_performance_classification(dtree_tuned, x_train, y_train),
model_performance_classification(dtree_tuned, x_test, y_test)], axis=0)
temp.index = ['Training dataset', 'Test dataset']
temp
| Precision | Recall | Accuracy | |
|---|---|---|---|
| Training dataset | 0.579915 | 0.643493 | 0.634597 |
| Test dataset | 0.556216 | 0.603388 | 0.582766 |
Let's look at the feature importance of this model and try to analyze why this is happening.
# Feature importance
importances = dtree_tuned.feature_importances_
# Rename columns
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
# Plot
plt.figure(figsize = (13, 13))
sns.barplot(x= importance_df.Importance, y= importance_df.index);
Feature Importance – Tuned Decision Tree
- The top contributing feature in the tuned decision tree model is StockOptionLevel, followed closely by OverTime, both of which have significantly higher importance scores than other features.
- DistanceFromHome, HourlyRate, and Age are also notable contributors, though with smaller impact.
- The high importance of StockOptionLevel and OverTime aligns with previous SHAP and coefficient analyses, reinforcing their relevance in attrition prediction.
- MonthlyIncome, JobSatisfaction, and NumCompaniesWorked have moderate influence, consistent with HR intuition about employee satisfaction and engagement.
- This insight helps HR teams prioritize key employee variables when designing retention strategies.
Let's plot the tree and check if the assumptions about overtime income.
As we know the decision tree keeps growing until the nodes are homogeneous, i.e., it has only one class, and the dataset here has a lot of features, it would be hard to visualize the whole tree with so many features. Therefore, we are only visualizing the tree up to max_depth = 4.
- Decision Tree of model without weights Unweightsed in workbook for visualization and understanding
from sklearn import tree
features = list(X.columns)
plt.figure(figsize = (15, 10), dpi=300)
# "node_ids": Show the ID number on each node.
# "class_names": Names of each of the target classes in ascending numerical order
tree.plot_tree(dt_no_weights, max_depth = 2, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)
plt.show()
Decision Tree Root Node Summary
The root node of the decision tree splits on the feature StockOptionLevel, with the condition:
StockOptionLevel <= -2.431.This node contains all 1029 samples in the dataset:
- Class 0 (Not Attrite): 863
- Class 1 (Attrite): 166
The Gini impurity at the root is 0.271, which suggests moderate impurity and class separation.
Left branch (True) proceeds with employees having StockOptionLevel <= -2.431, and is further split by:
YearsAtCompany <= -1.118→Age <= -0.27andOverTime <= 0.0
Right branch (False) with higher StockOptionLevel continues split on:
MonthlyIncome <= -0.884, and further byOverTimeandYearsAtCompany
These splits suggest that StockOptionLevel, YearsAtCompany, OverTime, and MonthlyIncome are key decision drivers in predicting attrition.
# Number of training samples
x_train.shape[0]
1029
- Number of samples of each class before split.
# Number of samples of label 0 (No attrition)
y_train[y_train == 0].shape[0]
863
# Number of samples of label 1 (Attrition)
y_train[y_train == 1].shape[0]
166
- class = y[0]: Label.
# Number of samples where feature "OverTime" is lower than or equal to 0.5
x_train[x_train.OverTime <= 0.5].shape[0]
757
# Number of samples where feature "OverTime" is greater than 0.5
x_train[x_train.OverTime > 0.5].shape[0]
272
757 + 272
1029
Decision Tree of model with weights
features = list(X.columns)
plt.figure(figsize = (22, 12), dpi=300)
tree.plot_tree(dt_weights, max_depth = 3, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)
plt.show()
Observationss
- The root node of the decision tree (Node #0) splits based on
StockOptionLevel <= -2.431, which is the most important feature for initial decision-making. - The tree reaches a maximum depth of 3 and incorporates important splits using features such as
OverTime,YearsAtCompany,MonthlyIncome, andDistanceFromHome. - Left branches (e.g., Node #1 and Node #2) tend to lean toward predicting attrition (class y[1]) when conditions related to lower income or fewer years at company are met.
- Right branches (e.g., Node #196 and Node #197) reflect higher
MonthlyIncomeorYearsAtCompany, classifying more as non-attrition (class y[0]). - The model shows how compensation and overtime behavior are strongly linked to attrition likelihood.
- Gini impurity values across nodes range from 0.139 to 0.5, indicating fairly decent class separation at deeper nodes.
- Node #2 and #117 specifically show good splits toward predicting attrition, based on
Age,JobSatisfaction, andMonthlyIncome. - Class weights applied (0: 0.17, 1: 0.83) effectively shift the model's learning to prioritize the minority class (Attrited), as reflected in the structure and outcomes of the decision tree.
- The root node of the decision tree (Node #0) splits based on
Decision Tree of model with weights and tuned
features = list(X.columns)
plt.figure(figsize = (15, 10), dpi=300)
tree.plot_tree(dtree_tuned, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)
plt.show()
Observatons
- The root node is based on StockOptionLevel <= -2.431, splitting all 1029 samples (863 not attrited, 166 attrited) with a Gini index of 0.5. The class distribution is relatively balanced, leading to an initial split.
- The left branch (True) leads to node #1, which splits on OverTime <= 0.0 and classifies toward attrition (y[1]). This branch contains 428 samples with a better Gini index of 0.469 and a stronger presence of attrition cases.
- The right branch (False) routes to node #4, which also splits on OverTime <= 0.0, but favors the non-attrition class (y[0]), with 601 samples and a Gini of 0.448. This indicates OverTime is a strong recurring splitter.
- Further splits in node #1 reveal:
- Node #2 has a Gini of 0.499 and samples almost evenly split (43.86 vs 48.14), making it an uncertain prediction region.
- Node #3 improves separation with Gini = 0.316 and shows a stronger bias toward attrition (42.33 vs 10.37).
- On the right side (node #4), node #5 has a Gini of 0.387 and maintains clear separation toward non-attrition (69.87 vs 24.9), while node #6 is balanced (Gini = 0.5), showing weaker separation.
- The repeated appearance of StockOptionLevel and OverTime in upper splits confirms their predictive strength, as previously validated in feature importance charts.
Overall, the tree structure shows improved class balance and separation in the tuned model. It reflects meaningful splits aligned with key HR features and supports interpretation for business application.
Random Forest¶
Building the Random Forest Classifier Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample, a decision tree makes a prediction.
The results from all the decision trees are combined and the final prediction is made using voting (for classification problems) or averaging (for regression problems).
from sklearn.ensemble import RandomForestClassifier
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
rf_estimator.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(x_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 863
1 1.00 1.00 1.00 166
accuracy 1.00 1029
macro avg 1.00 1.00 1.00 1029
weighted avg 1.00 1.00 1.00 1029
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(x_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support
0 0.85 0.99 0.92 370
1 0.78 0.10 0.17 71
accuracy 0.85 441
macro avg 0.81 0.55 0.55 441
weighted avg 0.84 0.85 0.80 441
Model Comparison Summary
| Model | Dataset | Precision (1) | Recall (1) | F1-Score (1) | Accuracy |
|---|---|---|---|---|---|
| Decision Tree (Weighted) | Train | 1.00 | 1.00 | 1.00 | 1.00 |
| Test | 0.34 | 0.41 | 0.37 | 0.78 | |
| DT Tuned (GridSearchCV) | Train | 0.25 | 0.66 | 0.37 | 0.65 |
| Test | 0.22 | 0.63 | 0.32 | 0.64 | |
| Random Forest (Weighted) | Train | 1.00 | 1.00 | 1.00 | 1.00 |
| Test | 0.10 | 0.78 | 0.17 | 0.85 |
Observations
- The Decision Tree (Weighted) model performs perfectly on the training set but significantly drops in performance on the test set, indicating overfitting.
- Tuned Decision Tree (via GridSearchCV with
recallscoring) slightly reduces training performance to reduce overfitting, but it still struggles on the test set with both low precision and F1-score. - Random Forest performs excellently on training data (again, overfit), but its test performance is poor for the minority class (Recall = 0.10), despite overall accuracy being high (due to majority class dominance).
- The precision-recall trade-off highlights that all models have difficulty generalizing well for minority (Attrite = 1) predictions, common in imbalanced classification problems.
# Checking performance on the TRAINING SET
model_performance_classification(rf_estimator, x_train, y_train)
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 |
# Checking performance on the TEST SET
rf_estimator_test = model_performance_classification(rf_estimator,x_test,y_test)
rf_estimator_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.814815 | 0.546593 | 0.85034 |
Random Forest Model Performance Observations
- The training performance shows perfect scores: Precision = 1.0, Recall = 1.0, and Accuracy = 1.0, indicating the model has completely overfit the training data.
- On the test set, while the overall accuracy remains high at 85.0%, the recall for the positive class (Attrition) drops to 0.55, and precision is around 0.81.
- This performance gap suggests that the model memorizes the training data rather than generalizing well to unseen examples.
- The slightly improved recall (compared to previous RF results) still isn’t sufficient for sensitive attrition detection, where missing a potential leaver is costly.
Let's check the feature importance of the Random Forest
importances = rf_estimator.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x= importance_df.Importance, y=importance_df.index);
Feature Importance from Random Forest
- The Random Forest model identifies MonthlyIncome, MonthlyRate, Age, and DailyRate as the top predictors of employee attrition.
- YearsAtCompany, HourlyRate, and StockOptionLevel also contribute substantially to the model's predictions.
- Several previously observed impactful features like OverTime, JobSatisfaction, and EnvironmentSatisfaction still show importance but with lower weights compared to income-related variables.
- This suggests the Random Forest model prioritizes compensation-related features more heavily than role or engagement indicators when making predictions.
Tuning the Random Forest classifier using GridSearch¶
n_estimators: The number of trees in the forest.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
max_features{“auto”, “sqrt”, “log2”, 'None'}: The number of features to consider when looking for the best split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
%%time
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# Grid of parameters to choose from
params_rf = {"n_estimators": [100, 250, 500],
"min_samples_leaf": np.arange(1, 4, 1),
"max_features": [0.7, 0.9, 'auto']} # max features: number of features to randomly select from data set (.7 is 70% of features to 90%)
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
CPU times: user 3min 11s, sys: 271 ms, total: 3min 11s Wall time: 3min 34s
rf_estimator_tuned.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, max_features=0.7,
min_samples_leaf=np.int64(3), n_estimators=250,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, max_features=0.7,
min_samples_leaf=np.int64(3), n_estimators=250,
random_state=1)# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(x_train)
metrics_score(y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 1.00 1.00 1.00 863
1 0.99 1.00 1.00 166
accuracy 1.00 1029
macro avg 1.00 1.00 1.00 1029
weighted avg 1.00 1.00 1.00 1029
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(x_test)
metrics_score(y_test, y_pred_test_rf_tuned)
precision recall f1-score support
0 0.89 0.96 0.92 370
1 0.62 0.37 0.46 71
accuracy 0.86 441
macro avg 0.75 0.66 0.69 441
weighted avg 0.84 0.86 0.85 441
# Checking performance on the TRAINING SET
model_performance_classification(rf_estimator_tuned, x_train, y_train)
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.997006 | 0.999421 | 0.999028 |
# Checking performance on the TEST SET
rf_estimator_tuned_test = model_performance_classification(rf_estimator_tuned, x_test, y_test)
rf_estimator_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.753133 | 0.661477 | 0.861678 |
- Observations
| Model Type | Dataset | Precision | Recall | Accuracy |
|---|---|---|---|---|
| RF Untuned | Training | 1.000 | 1.000 | 1.000 |
| Test | 0.814 | 0.547 | 0.850 | |
| RF Tuned (GridSearch) | Training | 0.997 | 0.999 | 0.999 |
| Test | 0.753 | 0.661 | 0.862 |
Observations
- Overfitting is evident in the untuned RF model: it achieves perfect training metrics but significantly lower test recall (54.7%).
- Tuning improved generalization slightly, especially in recall (from 0.547 to 0.661), though it still underperforms in minority class detection.
- Accuracy remains high for both models due to class imbalance, but recall for class 1 (attrition) is more critical for business use cases.
- The tuned model better balances precision and recall on the test set, suggesting it is a more stable option for deployment despite a small drop in precision.
# Plotting feature importance
importances = rf_estimator_tuned.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x= importance_df.Importance, y= importance_df.index)
<Axes: xlabel='Importance', ylabel='None'>
Observations
MonthlyIncome,StockOptionLevel, andOverTimeare the top drivers of attrition prediction.Compensation-related features dominate the top rankings, indicating financial and reward factors are highly influential.
YearsAtCompany,DailyRate, andMonthlyRatealso play key roles, showing tenure and pay structure relevance.Behavioral and satisfaction features (e.g.,
JobSatisfaction,EnvironmentSatisfaction,WorkLifeBalance) have moderate to lower impact.Business travel and job role variables contribute minimally, suggesting limited influence in the tuned model.
Boosting Models¶
Let's now look at the other kind of Ensemble technique knowns as Boosting
Understanding Boosting in Machine Learning
The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.
XGBoost¶
- XGBoost stands for Extreme Gradient Boosting.
- XGBoost is a tree-based ensemble machine learning technique that improves prediction power and performance by improvising on the Gradient Boosting framework and incorporating reliable approximation algorithms. It is widely utilized and routinely appears at the top of competition leader boards in data science.
# # Installing the xgboost library using the 'pip' command.
# !pip install xgboost
# Importing the AdaBoostClassifier and GradientBoostingClassifier [Boosting]
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# Importing the XGBClassifier from the xgboost library
from xgboost import XGBClassifier
# Adaboost Classifier
adaboost_model = AdaBoostClassifier(random_state = 1)
# Fitting the model
adaboost_model.fit(x_train, y_train)
# Model Performance on the test data
adaboost_model_perf_test = model_performance_classification(adaboost_model,x_test,y_test)
adaboost_model_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.808492 | 0.598877 | 0.861678 |
# Gradient Boost Classifier
gbc = GradientBoostingClassifier(random_state = 1)
# Fitting the model
gbc.fit(x_train, y_train)
# Model Performance on the test data
gbc_perf_test = model_performance_classification(gbc, x_test, y_test)
gbc_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.789661 | 0.648458 | 0.868481 |
# XGBoost Classifier
xgb = XGBClassifier(random_state = 1, eval_metric = 'logloss')
# Fitting the model
xgb.fit(x_train,y_train)
# Model Performance on the test data
xgb_perf_test = model_performance_classification(xgb,x_test,y_test)
xgb_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.786274 | 0.666882 | 0.870748 |
Observations
- All three boosting models were tested on the same dataset to compare performance in predicting employee attrition.
- AdaBoost Classifier achieved a test accuracy of 0.8617, with recall = 0.5989, showing good balance but lower sensitivity than others.
- Gradient Boosting Classifier slightly outperformed AdaBoost with a test accuracy of 0.8684 and a better recall = 0.6485, making it more effective at detecting attrition cases.
- XGBoost Classifier yielded the best recall of the three at 0.6688, with a test accuracy of 0.8707, combining robust predictive power with strong sensitivity to the positive class.
- All boosting models demonstrate comparable precision (~0.78–0.80), but XGBoost shows superior balance between recall and accuracy.
Conclusion:
- XGBoost is the most reliable of the three boosting models in this context, offering the best trade-off between recall and accuracy on the test set.
- This makes XGBoost especially suitable for applications where identifying attrition cases (class 1) is a business priority.
Hyperparameter Tuning: Boosting¶
Hyperparameter tuning is a great technique in machine learning to develop the model with optimal parameters. If the size of the data increases, the computation time will increase during the training process.
- For practice purposes, we have listed below some of the important hyperparameters for each algorithm that can be tuned to improve the model performance.
- Adaboost
Some important hyperparameters that can be tuned: base_estimator object, default = None The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1
n_estimators int, default = 50 The maximum number of estimators at which boosting is terminated. In the case of a perfect fit, the learning procedure is stopped early.
learning_rate float, default = 1.0 Weight is applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier.
For a better understanding of each parameter in the AdaBoost classifier, please refer to thissource.
- Gradient Boosting Algorithm
Some important hyperparameters that can be tuned:
n_estimators: The number of boosting stages that will be performed.
max_depth: Limits the number of nodes in the tree. The best value depends on the interaction of the input variables.
min_samples_split: The minimum number of samples required to split an internal node.
learning_rate: How much the contribution of each tree will shrink.
loss:Loss function to optimize.
For a better understanding of each parameter in the Gradient Boosting classifier, please refer to this source.
- XGBoost Algorithm
Some important hyperparameters that can be tuned: booster [default = gbtree ] Which booster to use. Can be gbtree, gblinear, or dart; gbtree and dart use tree-based models while gblinear uses linear functions.
- min_child_weight [default = 1]
The minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.The larger min_child_weight is, the more conservative the algorithm will be.
For a better understanding of each parameter in the XGBoost Classifier, please refer to this source.
Comparison of all the models we have built so far
models_test_comp_df = pd.concat([dtree_test.T,
dtree_tuned_test.T,
rf_estimator_test.T,
rf_estimator_tuned_test.T,
adaboost_model_perf_test.T,
gbc_perf_test.T,
xgb_perf_test.T], axis = 1)
models_test_comp_df.columns = ["Decision Tree classifier",
"Tuned Decision Tree classifier",
"Random Forest classifier",
"Tuned Random Forest classifier",
"Adaboost classifier",
"Gradientboost classifier",
"XGBoost classifier"]
print("Test performance comparison:")
Test performance comparison:
models_test_comp_df
| Decision Tree classifier | Tuned Decision Tree classifier | Random Forest classifier | Tuned Random Forest classifier | Adaboost classifier | Gradientboost classifier | XGBoost classifier | |
|---|---|---|---|---|---|---|---|
| Precision | 0.609450 | 0.556216 | 0.814815 | 0.753133 | 0.808492 | 0.789661 | 0.786274 |
| Recall | 0.627198 | 0.603388 | 0.546593 | 0.661477 | 0.598877 | 0.648458 | 0.666882 |
| Accuracy | 0.775510 | 0.582766 | 0.850340 | 0.861678 | 0.861678 | 0.868481 | 0.870748 |
Conclusion¶
Final Model Comparison Observations (Test Set)
- Among all models, XGBoost Classifier delivers the highest test accuracy (0.8707) and the best recall (0.6688), making it the top performer overall.
- Gradient Boosting Classifier follows closely with an accuracy of 0.8685 and recall of 0.6485, showing robust performance across all metrics.
- Both AdaBoost and Tuned Random Forest models provide a strong balance, each achieving an accuracy of 0.8617, though XGBoost still edges them out in recall.
- Untuned Random Forest Classifier demonstrates high precision (0.8148) but suffers from low recall (0.5469), indicating it's overly conservative in predicting attrition.
- Decision Tree (tuned or not) significantly underperformed in comparison, with lower accuracy and recall (e.g., tuned DT recall = 0.6034, accuracy = 0.5828).
- In general, boosting methods outperformed basic tree models and random forests in both recall and accuracy, highlighting their effectiveness on this classification task.
Recommendation:
- XGBoost should be the preferred model for deployment due to its superior balance of recall and accuracy, which is crucial for correctly identifying employees likely to leave.
notebook_path = '/content/drive/MyDrive/My DS DA/Employee Attrition/Employee Attrition Prediction.ipynb'
!jupyter nbconvert --to html "{notebook_path}"
from google.colab import files
files.download('/content/drive/MyDrive/My DS DA/Employee Attrition/Employee Attrition Prediction.html')